The data considered represents the climatological data gathered from ten Ocean Buoy located in the western Atlantic. The ten Buoy spans the region between southern Bermuda to Puerto Rico. The data includes atmospheric and oceanic information. The buoy considered in this study are owned by NOAA and the National Weather Service. However, there are other buoy operating in the same area and owned by different parties. In actual studies, all the data from all the buoys are considered which make this data accessible to all concerned entities. For this project, we are considering the NOAA buoys as their data are available to the public and the contained information is sufficient to the project objective. The following picture depicts how the buoy looks like

The significance of this data comes from the location of the buoy as they are all located in a very active region in terms of storms and hurricanes.
Each buoy data is divided in separate files where each file corresponds to a specific month. The period considered is between January 2016 and September 2016. The features in the data includes:
Based on the information included in the data, the data is suitable to study climatic changes near the buoy. It can also be used to monitor the evolution of a storm system and it can also be a good basis to evaluate the buoy vandalizme
The data downloaded consists of 10 bouys located south of Bermuda to Puerto Rico in the western portion of the Atlantic. Each bouy data consists of nine text files, that represent each month of data collected from January to September of 2016. For each sensor, missing data is designated as series of 9's, such as 99.0 or 9999.0. These values are converted into null values during the importation of the data. Otherwise, no other conditioning is done.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML
%matplotlib inline
# Months enums,
January = 1
February = 2
March = 3
April = 4
May = 5
June = 6
July = 7
August = 8
September = 9
def read_month(id, month, year):
dateparse = lambda x: pd.datetime.strptime(x, '%Y %m %d %H %M')
path = 'datasets/{}/{}{}{}.txt'.format(id, id, month, year)
df = pd.read_csv(path,
delim_whitespace = True,
skiprows = [1],
parse_dates = {'DATETIME' : [0,1,2,3,4] },
date_parser=dateparse)
df = clean_data(df)
df['ID'] = id
return df
def read_year(id, fromMonth, toMonth, year):
bouy_data = []
for month in range(fromMonth, toMonth + 1):
bouy_data.append(read_month(id, month, year))
return pd.concat(bouy_data)
# Removes missing values that are designated by a series of 9 values.
# Excerpt from NOAA Site: " Missing data in the Realtime files are denoted by "MM"
# while a variable number of 9's are used to denote missing data in the
# Historical files, depending on the data type (for example: 999.0 99.0)."
# Input:
# df = dataframe to be cleaned
# Return:
# df = cleaned dataframe
def clean_data(df):
for column in df.columns[1:]:
max_val = df[column].max()
if(max_val % 9 == 0):
df.loc[df[column] == max_val, column] = None
return df
# data values
buyo_ids = [41002, 41040, 41041, 41043, 41044, 41046, 41047, 41048, 41049, 42059]
startMonth = 1 # January
endMonth = 9 # September
totalMonths = 9
year = 2016
bs = { id : read_year(id, startMonth, endMonth, year) for id in buyo_ids }
# Concate all the data into one DataFrame
bouy_data = []
for id in buyo_ids:
bouy_data.append(read_year(id, startMonth, endMonth, year))
bouy_data = pd.concat(bouy_data)
def get_name(predictor):
values = {
'WDIR' : 'Wind Direction',
'WSPD' : 'Wind Speed',
'GST' : 'Gust Speed',
'WVHT' : 'Wave Height',
'DPD' : 'Dominant Wave Period',
'APD' : 'Average Wave Period',
'MWD' : 'DPD Direction',
'PRES' : 'Sea Level Pressure',
'ATMP' : 'Air Temperature',
'WTMP' : 'Sea Surface Temperature',
'DEWP' : 'Dewpoint Temperature',
'VIS' : 'Station Visibility',
'TIDE' : 'Water Level'
}
return values[predictor]
The following are some basic data descriptors for the entire dataset.
print "Bouy data was downloaded for the following bouy IDs:\n"
for bouy in bouy_data['ID'].unique():
print bouy
print ""
print "Date Range: ", bouy_data['DATETIME'].min(), ' to ', bouy_data['DATETIME'].max()
print "Total Number of Records: ", bouy_data.shape[0]
Descriptive Statistics by Bouy
pd.set_option('precision', 1)
for bouy in bouy_data['ID'].unique():
display(HTML("<h2><center> Descriptive Statistics for Bouy ID: " + str(bouy) + "</center></h2>"))
subset = bouy_data[bouy_data['ID'] == bouy][[u'WDIR', u'WSPD', u'GST', u'WVHT', u'DPD', u'APD', u'MWD',
u'PRES', u'ATMP', u'WTMP', u'DEWP', u'VIS', u'TIDE']].describe()
print subset[[u'WDIR', u'WSPD', u'GST', u'WVHT', u'DPD', u'APD', u'MWD',
u'PRES', u'ATMP', u'WTMP', u'DEWP']].loc[[u'mean', u'std', u'min', u'max']]
ax = subset.loc['count'].plot(kind='bar', figsize=(10, 3), title ="Bouy " + str(bouy))
ax.set_xlabel("Sensors", fontsize=12)
ax.set_ylabel("Recorded Measurements", fontsize=12)
ax.set_ylim([0,30000])
plt.show()
The histograms plots of measurements recorded for each sensor shows that all instruments do not have the same number of recorded measurements for a select bouy. In addition, each bouy doesn't have the same number of measurements when compared to each other. Some bouys have approaximately 25,000 data points, while some only have a little over 5,000. In addition, none of the bouys have records for Visability (VIS) or Tide sensors.
for sensor in [u'WDIR', u'WSPD', u'GST', u'WVHT', u'DPD', u'APD', u'MWD',u'PRES', u'ATMP', u'WTMP', u'DEWP']:
fig, ax = plt.subplots(1, figsize=(10,5))
for bouy in bouy_data['ID'].unique():
data = bouy_data[bouy_data['ID'] == bouy][sensor].dropna()
ax.hist(data, bins=30, alpha=0.5, label='Bouy ' + str(bouy))
ax.set_ylabel('Frequency')
ax.set_xlabel(sensor)
plt.title(get_name(sensor))
plt.legend(loc='best')
plt.show()
By looking at the histograms of of each sensor for all the bouys, we can learn a few things. For some sensors, the distribution of data follow a similiar pattern across all bouys, such as Sea Level Pressure and Wave Height. Some sensor data have normal distributions, such as Average Wave Period and Sea Level Period. Other sensors, such as Sea Surface Temperatures, don't seem to follow any pattern of a distribution or have a correlation on frequency with other bouys.
Time Series Plotting
for sensor in [u'WDIR', u'WSPD', u'GST', u'WVHT', u'DPD', u'APD', u'MWD',u'PRES', u'ATMP', u'WTMP', u'DEWP']:
display(HTML("<h2><center>" + get_name(sensor) + "</center></h2>"))
fig, ax = plt.subplots(len(bouy_data['ID'].unique()),1, figsize=(20,25))
max_val = bouy_data[sensor].max()
min_val = bouy_data[sensor].min()
index = 0
for bouy in bouy_data['ID'].unique():
data = bouy_data[bouy_data['ID'] == bouy][[sensor, 'DATETIME']]
ax[index].plot(data['DATETIME'], data[sensor])
ax[index].set_ylabel('Bouy ' + str(bouy))
ax[index].set_ylim([min_val,max_val])
index = index + 1
plt.show()
The time series data tell us a few things. Some data, like Wind Direction, have high variability over short periods of time, while some sensors, like Sea Surface Temperature have more gradual changes overtime. It also appears that Bouys 41043, 41044, 41046, and 41047 had some sensors stop recording around May 2016 and Bouy 42059 also had some issues early in the year. There are also some straight lines in the plots, which would suggest those are actually data gaps, potentially the sensor failed at that time or could not transmit the data to a base station.